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ABSTRACT 



The term "equating" refers to a statistical procedure that 



adjusts test scores on different forms of the same examination so that scores 
can be interpreted interchangeably. This study examines the impact of 
equating with fewer items than originally planned when items have been 
removed from the equating set for a variety of reasons . A real data set from 
a licensure/certification examination was used for the study. Three linear 
equating methods and three test forms were involved. The sample size for each 
of the test forms was at least 10,000 examinees. Item sets were manipulated 
to discard different numbers of items. The scale scores computed using the 
decreased sets of equators were either unchanged or one point higher or lower 
than the scale scores computed using the original set of equators. The scale 
score fluctuations in nearly all of the decreased equating sets affected the 
passing scores of approximately two percent of the examinees, either 
increasing or decreasing the required scale score by one point. For a set in 
which two scale scores varied rather than just one, approximately four 
percent of the examinees were affected. Findings show that the equating plan 
designed for this examination was highly robust. (SLD) 
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Presented at AERA 2002, New Orleans 

Patricia L. Hanick and Chi-Yu Huang 1 
ACT, Inc. 

Background 

Equating is a statistical procedure that adjusts test scores on different forms of the same 
exam so that scores can be interpreted interchangeably. Equating enables comparisons of 
performance by different examinee groups, even though the groups were administered 
different forms of the test at different times. Many equating methods require exams to 
share a set of common items that are used for more than one administration. This 
common set of items provides statistical data to link the test forms for equating purposes. 

Licensure and certification examinations often are equated using common item, non- 
equivalent groups design. Because the set of common items is used in multiple test forms, 
it is imperative that the items remain secure. If the examination is highly competitive, it is 
often difficult to prevent items from being compromised during a test administration. A 
related and equally difficult situation is when items must be discarded because they no 
longer represent the content specifications of the total test, or have become invalid due to 
changes in the body of knowledge. When the discarded items belong to the set of 
equating items, licensure and certification policy makers may seek psychometric advice 
to help make informed decisions about how to proceed. 

Research Objective 

The purpose of this study is to address practical problems related to common item sets 
used for equating. More specifically, the study examines the impact of equating with 
fewer items than originally planned when items have been removed from the equating set 
for a variety of reasons. 

Researchers have examined the effect of the number of common items on the accuracy of 
equating. Kolen and Brennan (1995) stated that “the number of common items to use 
should be considered on both content and statistical grounds” (pp. 248). Klein and 
Jaijoura (1985) compared content representative sets of common items with larger sets of 
common items that were not content representative. They found that content 
representativeness of common item sets was "critical" to equating accuracy and 
concluded that the longer, non-representative common item sets produced less accurate 
equating results. Harris’ study (1991) later supported this conclusion. Gao, Hanson and 
Harris (1999) examined the effect of content and statistical non-representativeness on 
common item equating with non-equivalent groups. They found that content itself did not 
greatly impact equating results. However, if the common item set was not statistically 
representative, a content representative common item set may produce less equating error 
than a content non-representative set. 



1 To request copies of this paper, please contact either author at ACT, Incorporated, 2201 North Dodge 
Street, P.O. Box 168, Iowa City, I A 52243-0168. 



The present study was designed to examine the effect of diminishing content and 
statistical representativeness of common item sets relative to an original common item set 
when equating with two link forms. The focus was to provide useful information to 
persons in charge of testing programs confronted with the very real situation of what to 
do when equating items have been compromised. The outcomes of the study should 
provide guidance for how to proceed when test forms have been lost, breached or 
equating items discarded for other reasons. The study investigated: 

■ The effects of discarding 5 and 10 items from a set of 30 common items; 

■ The effects of discarding items that result in the common item set reflecting less 
accurately the statistical specifications of the total test; 

■ The effects of discarding items that result in the common item set reflecting less 
accurately the content areas of the total test. 

Methodology 

A real data set from a licensure/certification examination was used for this study. The 
exam consisted of 200 total items including 60 common items, ten for each of six content 
areas. The common items were divided evenly between two link exam forms with each 
form providing 30 items, 5 for each content area. When exams have been compromised 
in the past, typically only one of the two link exam forms has been affected. Therefore, 
for this study items were discarded from only one link form rather than both. 

Three linear equating methods were used (Tucker, Levine Observed Score, and Levine 
True Score) and three test forms were involved; one new form and two equating link 
forms. Sample size for each of the three test forms was ten thousand examinees, or more. 
The same data sets and equating methods were used for the various conditions under 
which items were discarded from the common item set. 

The equating results using the full set of common items provided baseline data that were 
compared with results using sets of common items from which items were discarded. 
Twelve common item sets were manipulated: six sets discarded five equators from each 
set (8% of the total number of common items) and six sets discarded ten equators from 
each set (17% of the total number of common items). Discarded items were included in 
the computations for whole test results, but were excluded from the computations for the 
common item equating set. 

When items would be removed from an equating set, obviously the overall characteristics 
would change for the entire item set. Although many characteristics of the decreased item 
sets could be studied, this research focused on examining changes in the following areas: 

■ Scale score moments; 

■ Conversion of raw scores to scale scores within the range of passing scores; 

■ Standard error of equating within the range of passing scores. 

Study Design 

Tables 1 and 2 list the general characteristics of the total test, the original item set, and 
the decreased item sets. Table 1 shows the characteristics of the common item sets that 
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were decreased by five equators, and Table 2 shows the characteristics of the common 
item sets decreased by 10 equators. The first column displays the characteristics of the 
total test followed by the original equating set of 30 items from one link form. The 
average difficulty (p-value) and standard deviation (SD) is displayed for each of the six 
content areas. The remaining columns show characteristics of the item sets from which 
equators were discarded. 

Note that each decreased item set was manipulated with regard to the degree of 
representation of the six content areas and the level of item difficulty. In the past when 
items have been compromised, typically the items have represented different content 
areas rather than the same area. Consequently, for this study items were discarded from 
more than one content area rather than from only one content area. 

Differences between the original equating set and the decreased equating sets can be 
identified by comparing the number of items in the set and the average item difficulty by 
content area. In addition, the characteristics of the common item sets can be compared 
with the characteristics of the total test by content areas. 

Given the relatively small number of items discarded from each set, the content 
representation and average level of item difficulty could vary only moderately for each 
decreased equating set. Working within these limitations, items were discarded from one 
link form to create equating sets with the following characteristics: 

Level of content representation for decreased common item sets 

Content representation - items were discarded from content areas somewhat 
equally across all content areas 

■ 5 items were discarded — Sets A, B 

■ 1 0 items were discarded - Sets G, H 

Content non-representation - items were discarded from content areas somewhat 
unequally across all content areas 

■ 5 items were discarded - Sets C, D, E, F 

Most items were discarded from only two of the six content areas 

■ 10 items were discarded - Sets I, J, K, L 

Most items were discarded from only three of the six content areas 

Level of statistical representation for decreased common item sets 

Statistical representation - p-values were approximately the same for the original and 
decreased sets of common items 

■ Level of difficulty by content area was similar to the level of difficulty for the 
original equating set 

Statistical non-representation - p-values varied somewhat for the original set and 
decreased set of common items 

■ Some content areas were more difficult than the original equating set 

■ Some content areas were less difficult than the original equating set 
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Table 1 
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Table 2 

Characteristics of Total Test, Original Equating Set and Decreased Equating Sets 

(Discard 10 items in each set) 
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Findings 

The raw scores for the exam under investigation ranged from 1 to 200. For reporting 
purposes, the raw scores were linearly converted to scale scores that ranged from 0 to 
200. Different equating methods were used to analyze the data, and all methods produced 
very similar results when compared with the original outcomes. Therefore, the results 
reported are based on one equating method rather than reporting results from all the 
equating methods used. 

Table 3 lists the scale score moments of the original set of common items and the 12 
decreased common item sets. Also listed are the standard deviation, skewness, and 
kurtosis measures for each equating set. In general, regardless of the number of items 
discarded, the scale score moments from each set were quite similar to the original 
results. The extent to which the sets were content and statistically non-representative did 
not influence the results. 

No universal passing score was established for the exam program being analyzed. For the 
purpose of the study, several passing scores were considered within the range of 130 and 
135 scale score points. Of primary interest were the variations in the scale scores 
produced by the decreased common item sets within the range of passing scores. The 
most notable finding of the study is the small difference, if any, in the scale scores 
produced using the original set of 30 common items and those using sets decreased by 
either 5 or 10 items. Tables 4 and 5 compare differences in scale scores within the range 
of passing scores for the original equating set and the twelve decreased equating sets. 

The scale scores computed using the decreased sets of equators were either unchanged, or 
one point higher or lower than the scale scores computed using the original set of 
equators. Of the 84 total scale scores computed under the twelve conditions (7 passing 
scores x 12 reduced equator sets) 74 scores remained the same as the original scale score, 
8 scores increased by one point, and 2 scores decreased by one point. The equating sets 
that discarded five items produced four scale scores that differed from the original scale 
scores (three scores +1 and one score -1). The equating sets that discarded ten items 
produced six scale scores that differed from the original scale scores (five scores +1 and 
one score -1). 

The scale score fluctuations in nearly all of the decreased equating sets affected the 
passing scores of approximately 2 % of the examinees, either increasing or decreasing the 
required scale score by one point. For equating set G, approximately 4 % of the examinees 
were affected because two scale scores varied rather than just one. 

Discrepancies in scale scores could be the result of changes in the overall level of 
difficulty of the decreased equating sets relative to the mean difficulty level of the 
original equating set. Note that the level of difficulty (p- value) of the decreased equating 
sets was manipulated to modestly increase, decrease, or remain about the same as the 
original equating set. However, no pattern appeared in the scale scores that could be 
associated with the effect of varying the level of difficulty of the decreased equating sets. 
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In addition, the representation of the content areas of the decreased equating sets was 
manipulated to produce somewhat representative or non-representative sets. Again, no 
patterns appeared in the scale scores that could be associated with the effect of varying 
the content representation of the decreased equating sets. 

Sampling error also could have caused variations in outcomes such that scale scores 
would differ slightly, depending upon the sample analyzed. Variations in scale scores 
were so small, however, that equating with a variety of recommended equating methods 
using the original common item set produced fluctuations that were similar to results 
using the decreased equating sets. 

The standard error of equating (SEE) is an index that estimates the amount of random 
error in the equating process. Kolen and Brennan (1995) wrote a detailed summary of 
SEE calculation for different equating design, one of which was implemented for the 
study. Tables 6 and 7 show the SEE estimates when different sets of common items were 
used. Error estimates near the passing scores are of primary interest. Note the small 
difference in equating error estimates between those scores computed using the original 
set of equators and those using the decreased sets of equators. As would be expected, 
equating error estimates were lowest when using 30 items and increased slightly when 
using fewer items. Once again, the effect of content and statistical non-representativeness 
did not influence SEE. Differences in SEE were quite small and within the range 
typically reported for various administrations of the examination. 
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Table 3 

Scale Score Moments of Original Equating Set and Decreased Equating Sets 
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Table 4 

Scale Scores within the Range of Passing Scores 
Comparison of Original Equating Set with Decreased Equating Sets 
(Discard 5 items in each set) 
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Standard Error of Equating Estimates 
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Table 7 

Standard Error of Equating (SEE) within the Range of Passing Scores 
Comparison of Original Equating Set with Decreased Equating Sets 
(Less 10 items in each set) 
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Conclusions 

The purpose of this study was to use “real” data to examine the impact of equating with 
fewer items than required by a predetermined equating plan. The study was designed to 
examine the effect on reported scores of equating with sets of common items that were 
smaller and less representative of the content and difficulty level than the original, 
optimal set of common items. The most notable finding was the small difference, if any, 
in the scale scores produced using the original set of common items and those using sets 
decreased by 5 and 10 items. 

Findings suggest that the equating plan designed for this particular licensure/certification 
examination was highly robust. The equating design, which used 2 link forms with 30 
common items from each link, provided an adequate pool that withstood the impact of 
discarding 5 and 10 items from one link. The design seemed to create a buffer that 
produced satisfactory equating results under less than optimal conditions, such as those 
caused by security breaches and content changes. 

The effects of varying numbers of items discarded, from the common item set. 

Equating with five and ten fewer items than the original 60 common items seemed to 
have affected scale scores only slightly. Some scale scores remained unchanged, while 
other scores increased or decreased by only one point. Evidence based on the scale score 
moments support the consistency of scale scores across the common item sets. The 
number of discarded items had minimal impact on the resulting standard error of equating 
and was within the range of SEE typically exhibited in multiple administrations of the 
exam. The standard error of equating remained reasonable, although it was slightly higher 
when using common item sets from which equators were discarded, when compared with 
SEE using the original common item set. 

The effects of discarding items that result in the common item set reflecting less 
accurately the statistical specifications of the total test. Findings suggest that equating 
with a set of common items that does not exactly reflect the level of item difficulty of the 
total test has minimal impact on the resulting scale scores. 

The effects of discarding items that result in the common item set reflecting less 
accurately the content areas of the total test. Findings suggest that equating with a 
common set of items that does not equally represent the content areas of the total test has 
minimal impact on the resulting scale scores. 

The findings from this study are consistent with results from Gao, Hanson and Harris 
(1999), which showed that content non-representativeness for the common item set did 
not influence equating accuracy. Moreover, results from this study showed that statistical 
non-representativeness did not greatly influence the equating results, though larger 
numbers of discarded items increased equating error. 

Pass/fail decisions for high stakes licensure and certification examinations usually require 
multiple types of evaluation. Consequently, a one-point fluctuation in the scale score of 
one examination that is combined with scores from other assessments most likely would 
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have minimal impact on the overall pass/fail rate. However, if scores from multiple forms 
of evaluation were scaled to the examination under study, the impact of a one-point 
fluctuation in the scale score could be compounded. 

The results of this study suggest that when an equating plan is well designed, the equating 
process can withstand discarding compromised items without severely jeopardizing exam 
results. The fact that the equating design and methodology appears robust suggests that 
more complicated procedures, such as weighting the remaining items to obtain content 
representativeness, might not be necessary. Obviously, dramatic events could produce 
profound consequences on equating outcomes, but experiences to date would realistically 
suggest that discarding a limited number of common items is the most extreme 
consequence of managing compromised items. This finding should be reassuring to those 
in charge of operational testing programs that implement a well-constructed, double link 
equating design. 
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